testingiauh912fandomcom-20200214-history
Matin takhtesangi. irt and cts
THE ROLE OF IRT Item response theory (IRT) is a collection of statistical models and methods used for two broad purposes in the measurement of health outcomes: item analysis and scale scoring. The family of IRT models describes, in probabilistic terms, the relationship between a person's response to a survey question and his or her standing on the construct (e.g., emotional distress) being measured by the scale. Specifically, IRT models predict the probability of choosing each response category as a function of an underlying, unobserved trait and item parameters. IRT is a powerful tool for use in measurement of candidate ability, test item selection, and test form equating, particularly with respect to applications in computer-based testing. IRT allows us to evaluate examinee ability and to describe how well items on the test are performing. Instead of treating ability solely as a function of an examinee’s score, IRT uses the concept of an Item Characteristic Curve (ICC) to show the relationship between examinee ability and performance on an item. In IRT, ability and item parameters are both estimated based on examinees’ response patterns on the test. The number of item parameters to be estimated determines which IRT statistical model will be used. Although these models involve complex mathematical procedures, the basic concepts are easy to understand. For item analysis, the IRT model characterizes each scale item with a set of properties that describes its ability to discriminate among individuals at different levels along a trait continuum. For scale scoring, IRT uses the full information from a person's responses to each item to estimate their standing on the measured construct. Scale scoring using IRT estimates a score along the continuum of the construct being measured for persons who provide a particular sequence of item responses. Usually a person's score estimates include a measure of central tendency and a description of variability that is reported as a standard error of measurement. The IRT scale score may be computed using only the item parameters and the responses of a single individual to any arbitrarily selected set of items, and this is the basis for computer adaptive testing. IRT models come in many varieties, more than 100, and can handle unidimensional as well as multidimensional data, binary and polytomous response data, and ordered as well as unordered response data. The most commonly applied IRT models in health outcomes measurement are the unidimensional parametric family of polytomous-response models, which include the Rating Scale Model, the Partial Credit Model, the Generalized-Partial Credit Model, the Graded Response Model, and the Nominal Model. Each differs in the number of item parameters that are estimated for each scale item and the constraints placed on the model or data. The item parameters define how well an item performs for measuring different levels of the measured construct or trait such as fatigue. The threshold (or difficulty) parameter describes where along the trait continuum an item is most informative for differentiating between lower and higher function levels. The slope parameter describes the strength of an item for discriminating among different levels of the underlying construct. Discrimination is related to precision in that the more an item can discriminate among individuals at different levels of the construct, the more precision the item adds for measuring a person’s trait level. CAT (computer adaptive testing) Computer adaptive testing (CAT) is a method of administering health related quality of life measures by computer using the psychometric framework of Item Response Theory (IRT) These IRT-based adaptive tests are greatly facilitated by a computer because of the computational requirements of the algorithm and the logistics of item and data management. Items are selected on the basis of the patient’s responses to previously administered items. This process uses an algorithm to estimate a person’s score and the score’s reliability and then chooses the best next item, enabling scale administration based on specifications such as content coverage, test length, and standard error. The capacity to rank all patients on the same continuum, even if they have not been given any common items, allows for an assessment that is individually tailored to each person. With item banking, each patient need only answer a subset of items to obtain a measure that accurately estimates what would have been obtained by administering the entire set of items. CAT has been used successfully in educational, licensing, and achievement testing; personality assessment; and selection of military personnel. Technology in tests Technology is a powerful mechanism that can transform education. It is not surprising then, that the second language (L2) testing field adopted "language testing and technology" as the theme for its 2001 annual conference, the Language Testing Research Colloquium (LTRC). Technology is not a novel topic for language testers. The 1985 LTRC was also organized around this theme. Given the recurring theme and the growing use of technology, it is appropriate to reflect on what has transpired in the L2 testing field related to this topic and to situate the developments within the larger language testing, general measurement, and educational contexts. The L2 field's first concerted effort in terms of computer-based testing (CBT) emerged in the mid-80s with the 1985 LTRC. The conference proceedings were published under the title Technology and Language Testing (Stansfield, 1986). The proceedings indicate that several papers presented at the conference dealt with CBT and the application of latent trait models to item-bank construction, item selection, and computer adaptive testing (CAT). The general measurement profession had been working with CBT and, more specifically, with CAT since the early 70s. The first conference on CAT was held in 1975. Perhaps the main reason the L2 field has lagged behind in this area is because it has long promoted performance-based assessment, a form of assessment that does not lend itself as readily to computerized administration as do more traditional test formats. In fact, the second section of the Stansfield volume'' deals primarily with performance-based assessment. So, whereas general measurement researchers, especially those working with CAT, have concerned themselves more with selected-response item types, the L2 field has continued to promote performance-based assessment. Even today, CBT performance-based assessment continues to be a challenge. After Stansfield (1986), the most significant work on language testing and technology is an edited volume by Dunkel (1991). Dunkel's volume is devoted to computer-assisted language learning and CAT. The CAT section documents how knowledge in the L2 field has progressed in the area of CBT. For the most part, the studies continue to explore the value of the adaptive format and report on various CAT developments and validation research efforts. What is most significant about this volume, however, is the variety of CAT applications for assessing L2 proficiency in school and university settings, a clear indication of the growing use of CAT. Indeed, since the Dunkel volume, the number of CBT and CAT instruments developed by academic and testing organizations has continued to increase. Many universities and virtually every major language testing organization are engaged in the development of CBT or CAT. Brigham Young University was a pioneer in developing French, German, and Spanish CAT instruments used for placement at universities. The Educational Testing Service in 1998 launched CBT TOEFL in the in US and in numerous countries around the world. The University of Cambridge Local Examinations Syndicate (UCLES) also developed CAT instruments in various languages and for various purposes. More specifically, UCLES developed Communicate primarily for language programs in academic settings and BULATS (Business Language Testing Service) for the corporate sector. Languages targeted in these tests include English, French, German, and Spanish. In addition, the Council of Europe has sponsored the DIALANG project, which provides diagnostic assessment in 14 languages. Coupled with this proliferation of CBT and CAT instruments has been a steady flow of publications on the topic. These publications include those by Brown and Iwashita (1996, 1998); Young, Shermis, Brutten, and Perkins (1996); Shermis (1996); Burstein, Frase, Ginther, and Grant (1997); Brown (1997); Dunkel (1997); Chalhoub-Deville, Alcaya, and Lozier (1997); and Chalhoub-Deville and Deville (1999). The most recent book dealing with CAT appeared in 1999. The Chalhoub-Deville (1999) edited volume is devoted entirely to CAT issues, with experts from both the L2 and general measurement fields invited to share their research knowledge and experience in three interrelated areas: the L2 reading construct, L2 CAT applications, and IRT (internet-related technologies) measurement issues. The volume highlights the importance of a construct-driven approach to CAT work. This is exemplified most clearly in the discussion chapters that explore and identify links among the various areas to advance more systematic and construct-based CBT and CAT development. Over 15 years have passed since LTRC first focused on technology, and one can argue that great strides have been made in the area of CBT and CAT. The computerized delivery of tests has become an appealing and a viable medium for the administration of standardized L2 tests in academic and non-academic institutions. Additionally, a reasonable body of research exits in this area. Given the growing use of and research on CBT, an important issue to consider is the nature of the change that this mode of assessment has introduced to L2 testing. An examination of the changes brought forth by L2 CBT shows that technology has been intended primarily to help make assessment more efficient and serviceable, what Christensen (1997) calls "sustaining" innovations. CBT allows, among other things, more flexible and individualized test administration, tracking of student performance, immediate test feedback, new item/task types, and enhanced test security. Perhaps one of the most exciting capabilities of CBT is the adaptive approach, which one might argue, using Christensen's terminology, is a "disruptive" technology, that is, a technology that changes how we think of and implement our operations. CAT permits the tailoring of item difficulty to the test taker's performance, allowing a more accurate assessment of the examinee's L2 ability. But apart from the adaptive innovation, CBT has been utilized mainly to facilitate test delivery and administration. What is needed, therefore, is to explore how technology can engender fundamental changes in the L2 CBT endeavor. In a forthcoming paper (Chalhoub-Deville, in press), I discuss various issues in areas of L2 CBT can be regarded as a disruptive use of technology. Areas covered in this discussion include the representation of the L2 construct, overall test design, item/task construction, and test purpose. In terms of construct representation, it is well documented that the L2 construct is multidimensional and involves a variety of interacting components and processes (Bachman, 1990; Bachman & Palmer, 1996). Language testers need to utilize technology to design measures that increasingly explore and better measure such critical aspects of the construct. Additionally, researchers have argued that some abilities and processes, which are critical for beginning language learners, become less salient for more proficient learners, for whom yet other aspects of the construct begin to emerge (Bernhardt, 1991). Technology provides an excellent capability to trace test takers' language development thus enabling researchers to better understand how aspects of the construct evolve across different ability levels. Technology might also help test developers move beyond conventional test design procedures, which provide scores primarily to rank students, to other procedures that can facilitate a more systematic test design approach, one which creates interrelations among task characteristics, test takers' performances, and inferences about intended underlying abilities and processes. Such an approach would produce a richer and more meaningful depiction of test takers' abilities. An example of such an integrated and construct-based approach to CBT design is Portal (see Mislevy, 1996; Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999). Portal utilizes computer technology and alternative measurement models to examine test takers' performance on test tasks with documented features and provides rich information about underlying language components and processes. In a similar vein, technology can also facilitate a more meaningful approach to task development and enable test developers to draw more defensible inferences by establishing a closer link between task creation and underlying abilities. Prototype tasks with identified characteristics, based on a systematic analysis, can be fed into a database used to generate new tasks with the desired linguistic, cognitive, situational, and measurement characteristics. Additionally, computer technology advances now permit the use of more complex tasks in L2 tests. For example, simulation tasks allow test developers to elicit contextualized, integrated performances that closely resemble those in real-life L2 interactions. With the aid of technology, simulations with identified characteristics allow relevant features to be manipulated in a structured manner in order to target intended ability levels. Spolsky (1997) argues that the main purpose of today's L2 standardized tests reflect the "gate keeping" needs that emerged because of the increased educational demands and limited instructional resources experienced at the beginning of the 20th century. Technological advances, however, are fast transforming L2 learning opportunities. The proliferation of distance learning programs will likely result in a decreased need for selection testing and an increased need for assessments that grant credentialing or certification. Similarly, computer-delivered tests that assess and diagnose a person's language and skill development will be in greater demand. In fact, such tests are already available. DIALANG, mentioned above, is an example of this new generation of computer-delivered assessments, which have been developed to meet the needs of learners in non-conventional classroom settings. In conclusion, computer technology has enhanced the efficiency of many of our L2 testing practices and introduced notable innovations such as CAT. But most L2 CBT and CAT instruments available on the market fall short in providing any radical transformation of assessment practices. Advances in technology should encourage test developers to move beyond the thinking that has long dominated paper-and-pencil testing and inspire the use of "disruptive" applications, by which assessments are conceptualized and implemented in innovatively different ways. '''Web-based language testing' Interest in Web-based testing is growing in the language testing community, as was obvious at recent LTRC conferences, where it was the topic of a symposium on the DIALANG project (Alderson, 2001), a paper (Roever, 2000), several in-progress reports (Malone, Carpenter, Winke, Kenyon, 2001; Sawaki, 2001; Wang et al., 2000), and poster sessions (Carr, Green, Vongpumivitch, & Xi, 2001; Bachman et al., 2000). Web-based testing is also considered in Douglas's recent book (Douglas, 2000). It is the focus of research projects at UCLA and the University of Hawai'i at Manoa, and a number of online tests for various purposes are available at this time and are listed on Glenn Fulcher's Resources in Language Testing Web site (Fulcher, 2001). This paper is intended to advance the Web-based language testing movement by outlining some of the fundamental theoretical and practical questions associated with its development. Simply defined, a Web-based language test (WBT) is a computer-based language test which is delivered via the World Wide Web (WWW). WBTs share many characteristics of more traditional computer-based tests (CBTs), but using the Web as their delivery medium adds specific advantages while their delivery medium complicates matters. COMPUTER-BASED AND WEB-BASED TESTS ''' The pre-cursor to Web-based language tests (WBTs) are computer-based tests (CBTs; for a recent discussion see Drasgow & Olson-Buchanan, 1999), delivered on an individual computer or a closed network. CBTs have been used in second language testing since the early 80s (Brown, 1997), although the use of computers in testing goes back a decade (Chalhoub-Deville & Deville, 1999). Computers as a testing medium attracted the attention of psychometricians because they allow the application of item response theory for delivering adaptive tests (Wainer, 1990), which can often pinpoint a test taker's ability level faster and with greater precision than paper-and-pencil tests. Based on the test taker's responses, the computer selects items of appropriate difficulty thereby avoiding delivering items that are too difficult or too easy for a test taker, but instead selects more items at the test taker's level of ability than a non-adaptive test could include. But even for non-adaptive testing, computers as the testing medium feature significant advantages. CBTs can be offered at any time unlike mass paper-and-pencil administrations which are constrained by logistical considerations. In addition, CBTs consisting of dichotomously-scored items can provide feedback on the test results immediately upon completion of the test. They can also provide immediate feedback on each test taker's responses -- a characteristic that is very useful for pedagogical purposes. The seamless integration of media enhances the testing process itself, and the tracing of a test taker's every move can provide valuable information about testing processes as part of overall test validation. On the negative side, problems with CBTs include the introduction of construct-irrelevant variance due to test takers' differing familiarity with computers (Kirsch, Jamieson, Taylor, & Eignor, 1998), the high cost of establishing new testing centers, and the possibility of sudden and inexplicable computer breakdowns. '''Types of WBTs A WBT is an assessment instrument that is written in the "language" of the web, HTML. The test itself is consists of one or several HTML file(s) located on the tester's computer, the server, and downloaded to the test taker's computer, the client. Downloading can occur for the entire test at once, or item by item. The client computer makes use of web-browser software (such as Netscape Navigator or Microsoft Internet Explorer) ''to interpret and display the downloaded HTML data. Test takers respond to items on their (client) computers and may send their responses back to the server as FORM data, or their responses to dichotomously scored items may be scored clientside by means of a scoring script written in JavaScript. A script can provide immediate feedback, adapt item selection to the test taker's needs, or compute a score to be displayed after completion of the test. The same evaluation process can take place on the server by means of serverside programs. Many different kinds of WBTs are possible, depending on the developer's budget and programming expertise, as well as computer equipment available to test takers. On the low end of the continuum of technological sophistication are tests that run completely clientside and use the server only for retrieving items and storing responses. This type of test is the easiest to build and maintain because it does not require the tester to engage in serverside programming, which tends to involve complex code writing and requires close cooperation with server administrators. In a low-tech WBT, the server only holds the test or the item pool while the selection of the next test item is accomplished by means of a script located clientside. Test-taker responses are either scored clientside or sent to the tester's email box and stored for later downloading. This low-tech approach is preferable if limited amounts of test data can be expected, adaptively is crude or unnecessary, item pools are small, and testers are interested in remaining independent of computer and software professionals. A high-tech WBT, on the other hand, makes heavy use of the server, for example, by having the server handle item selection through adaptive algorithms or by placing a database program on the server to collect and analyze test-taker responses. Both tasks require testers to become highly familiar with the relevant software or involve computer specialists in test setup and maintenance. This high-tech approach is preferable in cases where large amounts of test data have to be handled, complex adaptive algorithms are used, item banks are large, and budgets allow for the purchase of expensive software and the hiring of computer professionals. In this paper, I will focus on the low-tech versions of Web-based tests, which give testers maximum control over test design, require very small operating budgets, and make the advantages of computer-based testing available to testers at many institutions. '''What to Test on the Web and How to Test It' The first step in any language testing effort is a definition of the construct for what is to be tested. Will the test results allow inferences about aspects of students' overall second language competence in speaking, reading, listening, and writing (Bachman, 1990; Bachman & Palmer, 1996). Or will the test directly examine their performance on second language tasks from a pre-defined domain (McNamara, 1996; Norris, Hudson, Brown, & Yoshioka, 1998; Shohamy, 1992, 1995), such as leaving a message for a business partner, writing an abstract, or giving a closing argument in a courtroom. Whether a test focuses on aspects of second language competence or performance, its construct validity is the overriding concern in its development and validation. To that end, the test developer must be able to detect sources of construct irrelevant variance, assess whether the construct is adequately represented, in addition to considering the test's relevance, value implications, and social consequences (Messick, 1989). Also, they must examine the test‘s reliability, authenticity, interactiveness, impact, and practicality (Bachman & Palmer, 1996). In the following section, appropriate content and item types for WBTs will be discussed and some WBT-specific validation challenges briefly described. Item Types in WBTs The Web is not automatically more suited for the testing of general second language competence or subject-specific second language performance than are other testing mediums. To the extent that the performance to be tested involves the Web itself (e.g., writing email, filling in forms), performance testing on the Web is highly authentic and very easy to do since testers only have to create an online environment that resembles the target one. However, a WBT or any computer-based test can never truly simulate situations like "dinner at the swanky Italian bistro" (Norris et al., 1998, pp. 110-112). Rather than analyzing the possibilities of Web-based testing primarily along the lines of the competence-performance distinction, it is more useful to consider which item types are more and which ones are less appropriate for Web-based testing. It is fairly easy to implement discrete-point grammar and vocabulary tests using radio buttons to create multiple choice items, cloze tests and C-tests with text fields for brief-response items, discourse completion tests or essays with large text areas, as well as reading comprehension tests with frames, where one frame displays the text and the other frame displays multiple-choice or brief-response questions. If the test items are dichotomous, they can be scored automatically with a scoring script. Such items can be contextualized with images (but see Gruba, 2000, for some caveats). They can also include sound and video files, although the latter are problematic: These files are often rather large, which can lead to unacceptably long download times, and they require an external player, a plug-in, which is beyond the tester's control. This plug-in allows test takers to play a soundfile repeatedly simply by clicking the plug-in's "Play" button. Probably the most serious drawback of WBTs in terms of item types is that, at this time, there is no easy way to record test-taker speech. Microphones are of course available for computers with soundcards, but recording and sending a sound file requires so much work on the part of the test taker that the error potential is unacceptably large. Validation of WBTs Quantitative and qualitative validation of WBTs does not differ in principle from validation of other types of tests. This is described in detail by Messick (1989) and Chapelle (1998, 1999). However, there are specific validity issues introduced by the testing medium that deserve attention in any WBT validation effort. Computer familiarity. It is well established that test takers' varying familiarity with computers can influence their scores and introduce construct-irrelevant variance (Kirsch et al., 1998). Tutorials to increase computer familiarity can eliminate this effect (Taylor, Jamieson, Eignor, & Kirsch, 1998) and the use of standard web-browsers in WBTs increases the likelihood that test takers are already acquainted with the testing environment. For example, Roever (2001) found no significant correlation between self-assessments of Web browser familiarity and scores on a Web-based test of second language pragmatics taken by 61 intermediate-level English as a second language (ESL) learners in the English Language Institute at the University of Hawai'i: Browser familiarity only accounted for 1%-3% of the variance in scores. Typing speed.' ''Differences in test takers' typing speed are potentially more serious sources of error variance and are not amenable to quick training. In oral debriefings, test takers in the Roever (2001) study complained about having too little time for the discourse completion section of the test, which required typing brief utterances and allowed 90 seconds per item. On average, test takers completed 83% of the brief response section, whereas they completed 99% of each of the test's two multiple-choice sections, in which they were allotted 60 seconds per item. Although a simple time increase for brief response items seems like an obvious option, the fact that no member of the native-speaker (NS) comparison group had the same problem, raises the question of whether and how typing speed and second language proficiency are related. '''Delivery failures and speededness. One issue in the development phase of a Web-based test is to ensure that the test does not "skip" items during delivery due to technical problems. This can happen if the test taker accidentally double-clicks instead of single-clicking a button, or if there are errors in the algorithm that selects the next item in an adaptive or randomized test . It can be difficult to "tease apart" whether an item was not answered because the test taker ran out of time or because the computer did not deliver the item. Loading time and timer. If the test is not delivered clientside but via the Web, download times can be negligible or considerable, depending on server traffic, complexity of the page, client computer speed, and a host of other factors beyond the test designer's control. It is therefore important for timed tests to stop the timer during downloads and restart it when the page is fully displayed. A Special Case: CATs on the Web Computer-adaptive tests are possible on the Web and do not pose many technical problems beyond those encountered in linear tests but it cannot be emphasized enough that the design of a sophisticated CAT is a very complex undertaking that requires considerable expertise in item response theory. Issues in designing and implementing CATs in second language assessment contexts have been discussed at length elsewhere (Chalhoub-Deville & Deville, 1999; Dunkel, 1999), so the following will only discuss issues specific to Web-adaptive tests (WATs). Like general WBTs, CATs and WATs can be designed at various levels of sophistication. A very simple WAT could display sets of items of increasing difficulty and break off when a test-taker scores less than 50% on a set. The test-taker's ability would then roughly lie between the difficulty of the final and the preceding set. This is fairly easy to realize on the Web, since all that is required is a count of the number of correct responses. However, such a test does not save much time for high-ability test takers who would have to proceed through most difficulty levels. So instead of starting at the lowest difficulty level, initial items could be of mid-difficulty. Subsequent sets would be more or less difficult depending on a test taker's score until the 50% correctness criterion is met. On the sophisticated end of CATs, complex algorithms re-compute ability estimates after every test taker response and select the best next item from a large item pool. Even these algorithms can run clientside, determine which item parameters are desirable for the next item, select an item from a list, and request that item from the server. This does not address the issue of item exposure, which is a major consideration in the item selection process since it potentially comprises test security: An overexposed item could be reconstructed by test takers after the test and communicated to others. However, this is hardly a concern for WATs, which are most apprioprate for low-stakes situations (discussed later in this article). In the event that WAT used in a medium or high-stakes situation necessitates exposure control, the simplest way of limiting exposure is by means of a randomization function, which selects an item from a pool of equivalent items with the same parameters (for a more complex approach, see Stocking & Lewis, 1995). However, this means that the item bank has to be quite large: Stocking (1994) recommends an item bank that is 12 times the test's length; Stahl and Lunz (1993) content themselves with 8-10 times. WHY WBTs IF WE ALREADY HAVE CBTs? ' Low-tech WBTs offer advantages over traditional CBTs with regard to their practicality (Bachman & Palmer, 1996), logistics, design, cost, and convenience. '"Anyplace, Anytime": The Asynchrony Principle Probably the single biggest logistical advantage of a WBT is its flexibility in time and space. All that is required to take a WBT is a computer with a Web browser and an Internet connection (or the test on disk). Test takers can take the WBT whenever and wherever it is convenient, and test designers can share their test with colleagues all over the world and receive feedback . The use of scoring scripts for dichotomously-scored items can make the test completely independent of the tester and increases flexibility and convenience for test takers even further. An important caveat is called for here, which will be elaborated further in the section on stakes. In high-stakes situations, test takers stand to gain an advantage by cheating, if uncontrolled and unsecured access is not feasible. In such cases, monitored and supervised testing facilities must be used, where the degree of supervision and standardization of the physical environment again depends on the stakes involved. Even if high stakes are involved, there are still advantages to delivering the test via the Web, that is, no specialized software necessary, existing facilities like computer labs can be used as testing centers. However, just the convenience of "any place, any time" access no longer holds. "Testing Goes Grassroots" Whereas producing traditional CBTs requires a high degree of programming expertise and the use of specially-designed and non-portable delivery platforms, WBTs are comparatively easy to write and require only a free, standard browser for their display. In fact, anybody with a computer and an introductory HTML handbook can write a WBT without too much effort, and anybody with a computer and a browser can take the test -- language testers do not have to be computer programmers to write a WBT. This is largely due to HTML's not being a true programming language but only a set of formatting commands, which instruct the client's Web browser how to display content. In addition, HTML contains elements that support the construction of common item types, such as radio buttons for multiple-choice items, input boxes for short response items, and text areas for extended response items (essays or dictations). Free or low-cost editing programs are available that further aid test design. Of course, just because it is easy to write WBTs does not mean that it is easy to write good WBTs. pretty pictures and animated images do not define test quality, and any test design and implementation must follow sound procedures (Alderson, Clapham & Wall, 1995) and include careful validation. Testing Goes Affordable A WBT is very inexpensive for all parties concerned. Testers can write the test by hand or with a free editor program without incurring any production costs except the time it takes to write the test. Once a test is written, it can be uploaded to a server provided by the tester's institution or to one of many commercial servers that offer several megabytes of free web space. Since WBTs tend to be small files of no more than a few kilobytes, space on a free server is usually more than sufficient for a test. The use of images, sound, or video can enlarge the test considerably, however, and may require the simultaneous use of several servers or the purchase of more space. For the test taker, the only expenses incurred are phone charges and charges for online time, but since many phone companies in the US offer flat rates for unlimited local calls and many Internet service providers have similar flat rate plans for unlimited web access, test takers may not incur any extra costs for a testing session. However, the situation can be markedly different outside North America, where phone companies still charge by the minute for local calls. In such cases, a version of the test that can be completed entirely offline should be provided and distributed via email or download. ISSUES AND LIMITATIONS OF USING WBTs ''' The following are some issues that should be considered during the conceptualization and the early stages of WBT development. '''Cheating and Item Exposure The greatest limitation of WBTs is their lack of security with respect to cheating and item confidentiality. Obviously, any test that test takers can take without supervision is susceptible to cheating. It is impossible to ensure that nobody but the test taker is present at the testing session, or that it is even the test taker who is answering the test questions. That limits the possible applications of unsupervised WBTs to low-stakes testing situations. Item confidentiality is also impossible to maintain, since test takers are not taking the test under controlled conditions, that is, they could just copy items off the screen. Also, items are downloaded into the web browser's cache on the test taker's computer, which means that they are temporarily stored on the test taker's hard drive, where they can be accessed. This is not a problem if items are created "on the fly" or if the item pool is constantly refreshed and each item is only used a few times. Of course, cheating and item confidentiality are less relevant to low-stakes situations and can be prevented if the test is taken under supervision. This reduces the "anyplace, anytime" advantage of a Web-based test, but it may be a viable option for medium-stakes tests or tests taken only by few test takers, where the establishment of permanent testing centers would not be cost-effective and trustworthy supervisors can be found easily at appropriate facilities. Self-Scoring Tests and Scripts Using JavaScript to make tests self-scoring is an attractive approach because it can save a great deal of tedious scoring work, but there is a potential problem associated with this scoring approach: The script contains all the answers. In other words, the answers to all items are downloaded on the test taker's computer where a techno-savvy test taker can easily view them by looking at the test's source code. This can be made a bit more difficult by not integrating the script in the HTML code but instead embedding it as a separate script file, but with a little searching, even that can be found in the browser cache. Solutions to this problem are supervision, scoring by the tester (e.g., by means of SPSS syntax), or serverside scoring scripts which would have to be written in Java, Perl, or serverside JavaScript. Data Storage ' Requirements for secure data storage differ by the type and purpose of the WBT. If the test is taken clientside only, for example, as a self-assessment instrument without any involvement of the tester, test-taker entries should be stored for the duration of the test so that a browser crash does not wipe out a test taker's work (and score) up to that point. However, as a security feature, Web browsers are generally prevented from writing to the test taker's hard disk. The only file to which they can write is a cookie file (cookie.txt on PC, cookie on Mac), and the main content that can be written to each individual cookie is one string of up to 2,000 characters (about two double-spaced pages). This may not be enough to save a long essay, but plenty to save numerical responses, short answers, and biodata. A problem here is that cookies as a means of data backup work only in ''Microsoft Internet Explorer, which updates the cookie physically on the hard drive every time it is modified. Netscape Navigator holds the cookie in memory and only updates it when the browser window is closed, so that a system crash in the middle of a testing session irretrievably erases the cookie. If the test involves the tester, that is, if test data are sent back to the tester's server, secure data storage is somewhat easier. The response to every item can be sent as a FORM email, so that a reconstruction of test taker responses is possible even after a browser or system crash. As an additional security feature to guard against server problems, sets of responses can be "harvested" by a JavaScript function and sent to a different server, so that in fact two or several records of each testing session exist. '''From Test to Spreadsheet: Think Backwards If complex serverside scripting or manual data entry of test-taker responses into a spreadsheet is to be avoided, the most convenient way of transferring responses is simply having test-taker responses to the entire test transferred at the same time in one final FORM email as a single long string. Testers then edit their email file (after saving it under a different name) so that it consists of nothing but those response strings (e.g., by making all the response strings bold and subsequently deleting everything that is not bold), which can be read into a spreadsheet as raw, unformatted text (ASCII data). It is important to think backwards in this design process, that is, start out by considering the requirements and limitations that the spreadsheet or data analysis programs impose. For example, SPSS delimits data points by means of commas or spaces which means that they should also be thus delimited in the response strings, and that all other commas and spaces have to be eliminated. Scripts should be devised to check test test-taker input for commas and spaces and replace them, for example, the test taker entering "Doe, John" should become "DoeJohn." Server Failure and Browser Incompatibility A variety of technical problems is possible in Web-based testing, but the most significant ones are server failure and browser incompatibilities. Server failure means that the server which houses the test is "down," so that test takers cannot access the test or continue a testing session where items are downloaded one by one. A simple way around this problem is to have "mirror sites" on alternate servers. Alternatively, all items can be downloaded at the beginning of the testing session as part of a script and can then be retrieved clientside. A client-related problem that can be a minor or major bother is incompatibility of HTML or script features with the browser used clientside. The two major Web browsers, Netscape Navigator and Microsoft Internet Explorer, function similarly but not identically, so that the same test may work as desired on one but not the other. Even more importantly, different generations of browsers can be quite different in the kind of scripting that they can handle. The easiest way to tackle the compatibility problem is to ensure that all test takers have exactly the same browser and browser version. In that case, testers need to write and pilot the test only for that specific browser. If that is not possible, the next best solution is to offer a standard version of the test with scripting and an alternative, no-frills (no-scripts, no-frames) version that runs on any browser. WEB-BASED TESTING OR NOT? THE CASE FOR A STAKES-DRIVEN DECISION Whether Web-based testing is appropriate for a given assessment purpose depends largely on the consequences of the test. Generally speaking, the lower the stakes involved, the more appropriate a WBT.'' '' Low-Stakes Assessment WBTs are particularly appropriate for any assessment in the service of learning, where assessment serves to give learners feedback on their performance and provides them with a gauge of how close they are to reaching a pre-specified learning goal (for an overview of the beneficial effects of assessment on learning, cf. Dempster, 1997). Such assessment can accompany classroom instruction or it can be a component of a web-based instruction system or a test-preparation system. Learners have no or little incentive to cheat on this type of assessment instrument since cheating would not be in their best interest. A second highly appropriate use of low-tech WBTs is for second language research and specifically, research on language tests. The great flexibility of WBTs lets research participants work on the instrument wherever and whenever is convenient for them, and test developers can use scripts to record participants' every move precisely, thereby gathering valuable information on item characteristics and appropriate degree of speededness. Finally, self-scoring instruments on the Web can be used for test preparation, either for large standardized tests or as pre-placement tests for students preparing to enroll in a foreign language program or a university in a foreign country. Such pre-placement will give test takers a general notion about how the students will perform on the test in question so that they can decide whether additional preparation is needed. Using a WBT for low-stakes assessment preserves all the Web advantages of this test type: Test takers can take the test in the privacy of their own homes, at a time of their choice, and at their own pace. Costs for designing and maintaining the test on the Web are low to non-existent. Medium-Stakes Assessment Assessment situations with medium stakes include placement tests for foreign students, midterm, or final exams in classes, and other assessment situations which affect learners' lives but do not have broad, life-altering consequences. In these testing situations, test takers have an incentive to cheat, so unsupervised use of WBTs is not indicated. The test has to be administered at a trustworthy testing site, for example, in the case of a placement test for an English Language Institute at a US university, the test can be administered in the university's own computer lab under supervision of a lab monitor. Even more conveniently for students and testers, the test can be administered at a trusted remote site (e.g., another university's computer lab) before students even enter the program, thereby allowing them to register for courses in advance, and giving administrators an early overview of how many courses will be needed. Another situation with medium stakes involves assessment for course credit. Distance education courses and classes taught via the Internet spring to mind because Web-based assessment will allow geographically dispersed test takers to take the test at a site near them. Supervised testing reduces the "anytime, anyplace" advantage of Web-based testing, but the information value of early placement for all stakeholders may often balance this loss. High-Stakes Assessment High-stakes assessment is any assessment whose outcome has life-changing implications for the test taker. Admission tests for universities or other professional programs, certification exams, or citizenship tests are all high-stakes assessment situations. Obviously, such assessment requires tight security, standardized-testing environments, and the most precise and informative testing methods. Even high-stakes assessment instruments can be realized as WBTs, and such an approach can greatly increase test availability and reduce testing expenses for testers and test takers. But these situations clearly require involvement of computer experts to make test delivery glitch-free and keep the item pool hacker-proof. Generally, at this time, the author would not recommend using the Web for high-stakes testing, which is better done on closed and secure intranets. 'THE FUTURE OF WEB-BASED LANGUAGE TESTING ' It may seem premature to talk about the future when Web-based language testing is only now beginning to emerge as an approach to testing. However, some central issues that will have to be dealt with can already be identified: *validation procedures for different types of media use, different types of delivery platforms and the equivalency of test-taking in different environments, *the potential, limits, and most appropriate uses of low-tech WBTs and high-tech WBTs, *oral testing over the web, as real-time one-on-one voice chat or computer-generated speech, *The possibilities of virtual reality for near-perfect task authenticity and performance-based testing.